Week 2 - Data Manipulation: dplyr

Emorie D Beck

Data Manipulation in dplyr

Core Functions

  1. %>%: The pipe. Read as “and then.”
  2. filter(): Pick observations (rows) by their values.
  3. select(): Pick variables (columns) by their names.
  4. arrange(): Reorder the rows.
  5. group_by(): Implicitly split the data set by grouping by names (columns).
  6. mutate(): Create new variables with functions of existing variables.
  7. summarize() / summarise(): Collapse many values down to a single summary.

Core Functions

  1. %>%
  2. filter()
  3. select()
  4. arrange()
  5. group_by()
  6. mutate()
  7. summarize()

Although each of these functions are powerful alone, they are incredibly powerful in conjunction with one another. So below, I’ll briefly introduce each function, then link them all together using an example of basic data cleaning and summary.

1. %>%

  • The pipe %>% is wonderful. It makes coding intuitive. Often in coding, you need to use so-called nested functions. For example, you might want to round a number after taking the square of 43.
sqrt(43)
[1] 6.557439
round(sqrt(43), 2)
[1] 6.56

1. %>%

The issue with this comes whenever we need to do a series of operations on a data set or other type of object. In such cases, if we run it in a single call, then we have to start in the middle and read our way out.

round(sqrt(43/2), 2)
[1] 4.64

1. %>%

The pipe solves this by allowing you to read from left to right (or top to bottom). The easiest way to think of it is that each call of %>% reads and operates as “and then.” So with the rounded square root of 43, for example:

sqrt(43) %>%
  round(2)
[1] 6.56

2. filter()

Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don’t want to include.

data(bfi) # grab the bfi data from the psych package
bfi <- bfi %>% as_tibble()
head(bfi)
# A tibble: 6 × 28
     A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2    E3
  <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1     2     4     3     4     4     2     3     3     4     4     3     3     3
2     2     4     5     2     5     5     4     4     3     4     1     1     6
3     5     4     5     4     4     4     5     4     2     5     2     4     4
4     4     4     6     5     5     4     4     3     5     5     5     3     4
5     2     3     3     4     5     4     4     5     3     2     2     2     5
6     6     6     5     6     5     6     6     6     1     3     2     1     6
# ℹ 15 more variables: E4 <int>, E5 <int>, N1 <int>, N2 <int>, N3 <int>,
#   N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>, O5 <int>,
#   gender <int>, education <int>, age <int>

2. filter()

Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don’t want to include.

summary(bfi$age) # get age descriptives
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3.00   20.00   26.00   28.78   35.00   86.00 

2. filter()

Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don’t want to include.

bfi2 <- bfi %>% # see a pipe!
  filter(age <= 18) # filter to age up to 18

summary(bfi2$age) # summary of the new data 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    3.0    16.0    17.0    16.3    18.0    18.0 

But this isn’t quite right. We still have folks below 12. But, the beauty of filter() is that you can do sequence of OR and AND statements when there is more than one condition, such as up to 18 AND at least 12.

2. filter()

Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don’t want to include.

bfi2 <- bfi %>%
  filter(age <= 18 & age >= 12) # filter to age up to 18 and at least 12

summary(bfi2$age) # summary of the new data 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   12.0    16.0    17.0    16.4    18.0    18.0 

Got it!

2. filter()

  • But filter works for more use cases than just conditional
    • <, >, <=, and >=
  • It can also be used for cases where we want a single values to match cases with text.
  • To do that, let’s convert one of the variables in the bfi data frame to a string.
  • So let’s change gender (1 = male, 2 = female) to text (we’ll get into factors later).
bfi$education <- plyr::mapvalues(bfi$education, 1:5, c("Below HS", "HS", "Some College", "College", "Higher Degree"))

2. filter()

Now let’s try a few things:

1. Create a data set with only individuals with some college (==).

bfi2 <- bfi %>% 
  filter(education == "Some College")
unique(bfi2$education)
[1] "Some College"

2. filter()

Now let’s try a few things:

2. Create a data set with only people age 18 (==).

bfi2 <- bfi %>%
  filter(age == 18)
summary(bfi2$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     18      18      18      18      18      18 

2. filter()

Now let’s try a few things:

3. Create a data set with individuals with some college or above (%in%).

bfi2 <- bfi %>%
  filter(education %in% c("Some College", "College", "Higher Degree"))
unique(bfi2$education)
[1] "Some College"  "Higher Degree" "College"      

%in% is great. It compares a column to a vector rather than just a single value, you can compare it to several

bfi2 <- bfi %>%
  filter(age %in% 12:18)
summary(bfi2$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   12.0    16.0    17.0    16.4    18.0    18.0 

3. select()

  • If filter() is for pulling certain observations (rows), then select() is for pulling certain variables (columns).
  • it’s good practice to remove these columns to stop your environment from becoming cluttered and eating up your RAM.

3. select()

  • In our bfi data, most of these have been pre-removed, so instead, we’ll imagine we don’t want to use any indicators of Agreeableness (A1-A5) and that we aren’t interested in gender.
  • With select(), there are few ways choose variables. We can bare quote name the ones we want to keep, bare quote names we want to remove, or use any of a number of select() helper functions.

3. select():

A. Bare quote columns we want to keep:

bfi %>%
  select(C1, C2, C3, C4, C5) %>%
  print(n = 6)
# A tibble: 2,800 × 5
     C1    C2    C3    C4    C5
  <int> <int> <int> <int> <int>
1     2     3     3     4     4
2     5     4     4     3     4
3     4     5     4     2     5
4     4     4     3     5     5
5     4     4     5     3     2
6     6     6     6     1     3
# ℹ 2,794 more rows
bfi %>%
  select(C1:C5) %>%
  print(n = 6)
# A tibble: 2,800 × 5
     C1    C2    C3    C4    C5
  <int> <int> <int> <int> <int>
1     2     3     3     4     4
2     5     4     4     3     4
3     4     5     4     2     5
4     4     4     3     5     5
5     4     4     5     3     2
6     6     6     6     1     3
# ℹ 2,794 more rows

3. select():

B. Bare quote columns we don’t want to keep:

bfi %>% 
  select(-(A1:A5), -gender) %>% # Note the `()` around the columns
  print(n = 6)
# A tibble: 2,800 × 22
     C1    C2    C3    C4    C5    E1    E2    E3    E4    E5    N1    N2    N3
  <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1     2     3     3     4     4     3     3     3     4     4     3     4     2
2     5     4     4     3     4     1     1     6     4     3     3     3     3
3     4     5     4     2     5     2     4     4     4     5     4     5     4
4     4     4     3     5     5     5     3     4     4     4     2     5     2
5     4     4     5     3     2     2     2     5     4     5     2     3     4
6     6     6     6     1     3     2     1     6     5     6     3     5     2
# ℹ 2,794 more rows
# ℹ 9 more variables: N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>,
#   O4 <int>, O5 <int>, education <chr>, age <int>

3. select():

C. Add or remove using select() helper functions.

  • starts_with()
  • ends_with()
  • contains()
  • matches()
  • num_range()
  • one_of()
  • all_of()
bfi %>%
  select(starts_with("C"))
# A tibble: 2,800 × 5
      C1    C2    C3    C4    C5
   <int> <int> <int> <int> <int>
 1     2     3     3     4     4
 2     5     4     4     3     4
 3     4     5     4     2     5
 4     4     4     3     5     5
 5     4     4     5     3     2
 6     6     6     6     1     3
 7     5     4     4     2     3
 8     3     2     4     2     4
 9     6     6     3     4     5
10     6     5     6     2     1
# ℹ 2,790 more rows

4. arrange()

  • Sometimes, either in order to get a better sense of our data or in order to well, order our data, we want to sort it
  • Although there is a base R sort() function, the arrange() function is tidyverse version that plays nicely with other tidyverse functions.

4. arrange()

So in our previous examples, we could also arrange() our data by age or education, rather than simply filtering. (Or as we’ll see later, we can do both!)

# sort by age
bfi %>% 
  select(gender:age) %>%
  arrange(age) %>% 
  print(n = 6)
# A tibble: 2,800 × 3
  gender education       age
   <int> <chr>         <int>
1      1 Higher Degree     3
2      2 <NA>              9
3      2 Some College     11
4      2 <NA>             11
5      2 <NA>             11
6      2 <NA>             12
# ℹ 2,794 more rows
# sort by education
bfi %>%
  select(gender:age) %>%
  arrange(education) %>%
  print(n = 6)
# A tibble: 2,800 × 3
  gender education   age
   <int> <chr>     <int>
1      1 Below HS     19
2      1 Below HS     21
3      1 Below HS     17
4      1 Below HS     18
5      1 Below HS     18
6      2 Below HS     18
# ℹ 2,794 more rows

4. arrange()

We can also arrange by multiple columns, like if we wanted to sort by gender then education:

bfi %>%
  select(gender:age) %>%
  arrange(gender, education) %>% 
  print(n = 6)
# A tibble: 2,800 × 3
  gender education   age
   <int> <chr>     <int>
1      1 Below HS     19
2      1 Below HS     21
3      1 Below HS     17
4      1 Below HS     18
5      1 Below HS     18
6      1 Below HS     32
# ℹ 2,794 more rows

Bringing it all together: Split-Apply-Combine

Bringing it all together: Split-Apply-Combine

  • Much of the power of dplyr functions lay in the split-apply-combine method

  • A given set of of data are:

    • split into smaller chunks
    • then a function or series of functions are applied to each chunk
    • and then the chunks are combined back together

5. group_by()

  • The group_by() function is the “split” of the method
  • It basically implicitly breaks the data set into chunks by whatever bare quoted column(s)/variable(s) are supplied as arguments.

5. group_by()

So imagine that we wanted to group_by() education levels to get average ages at each level

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  group_by(education) %>%
  print(n = 6)
# A tibble: 2,800 × 8
# Groups:   education [6]
     C1    C2    C3    C4    C5   age gender education   
  <int> <int> <int> <int> <int> <int>  <int> <chr>       
1     2     3     3     4     4    16      1 <NA>        
2     5     4     4     3     4    18      2 <NA>        
3     4     5     4     2     5    17      2 <NA>        
4     4     4     3     5     5    17      2 <NA>        
5     4     4     5     3     2    17      1 <NA>        
6     6     6     6     1     3    21      2 Some College
# ℹ 2,794 more rows

5. group_by()

  • Hadley’s first law of data cleaning: “What is split, must be combined”
  • This is super easy with the ungroup() function:
bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  group_by(education) %>%
  ungroup() %>%
  print(n = 6)
# A tibble: 2,800 × 8
     C1    C2    C3    C4    C5   age gender education   
  <int> <int> <int> <int> <int> <int>  <int> <chr>       
1     2     3     3     4     4    16      1 <NA>        
2     5     4     4     3     4    18      2 <NA>        
3     4     5     4     2     5    17      2 <NA>        
4     4     4     3     5     5    17      2 <NA>        
5     4     4     5     3     2    17      1 <NA>        
6     6     6     6     1     3    21      2 Some College
# ℹ 2,794 more rows

5. group_by()

Multiple group_by() calls overwrites previous calls:

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  group_by(education) %>%
  group_by(gender, age) %>%
  print(n = 6)
# A tibble: 2,800 × 8
# Groups:   gender, age [115]
     C1    C2    C3    C4    C5   age gender education   
  <int> <int> <int> <int> <int> <int>  <int> <chr>       
1     2     3     3     4     4    16      1 <NA>        
2     5     4     4     3     4    18      2 <NA>        
3     4     5     4     2     5    17      2 <NA>        
4     4     4     3     5     5    17      2 <NA>        
5     4     4     5     3     2    17      1 <NA>        
6     6     6     6     1     3    21      2 Some College
# ℹ 2,794 more rows

6. mutate()

  • mutate() is one of your “apply” functions
  • When you use mutate(), the resulting data frame will have the same number of rows you started with
  • You are directly mutating the existing data frame, either modifying existing columns or creating new ones

6. mutate()

To demonstrate, let’s add a column that indicated average age levels within each age group

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  arrange(education) %>%
  group_by(education) %>% 
  mutate(age_by_edu = mean(age, na.rm = T)) %>%
  print(n = 6)
# A tibble: 2,800 × 9
# Groups:   education [6]
     C1    C2    C3    C4    C5   age gender education age_by_edu
  <int> <int> <int> <int> <int> <int>  <int> <chr>          <dbl>
1     6     6     3     4     5    19      1 Below HS        25.1
2     4     3     5     3     2    21      1 Below HS        25.1
3     5     5     5     2     2    17      1 Below HS        25.1
4     5     5     4     1     1    18      1 Below HS        25.1
5     4     5     4     3     3    18      1 Below HS        25.1
6     3     2     3     4     6    18      2 Below HS        25.1
# ℹ 2,794 more rows

6. mutate()

mutate() is also super useful even when you aren’t grouping

We can create a new category

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  mutate(gender_cat = plyr::mapvalues(gender, c(1,2), c("Male", "Female")))
# A tibble: 2,800 × 9
      C1    C2    C3    C4    C5   age gender education    gender_cat
   <int> <int> <int> <int> <int> <int>  <int> <chr>        <chr>     
 1     2     3     3     4     4    16      1 <NA>         Male      
 2     5     4     4     3     4    18      2 <NA>         Female    
 3     4     5     4     2     5    17      2 <NA>         Female    
 4     4     4     3     5     5    17      2 <NA>         Female    
 5     4     4     5     3     2    17      1 <NA>         Male      
 6     6     6     6     1     3    21      2 Some College Female    
 7     5     4     4     2     3    18      1 <NA>         Male      
 8     3     2     4     2     4    19      1 HS           Male      
 9     6     6     3     4     5    19      1 Below HS     Male      
10     6     5     6     2     1    17      2 <NA>         Female    
# ℹ 2,790 more rows

6. mutate()

mutate() is also super useful even when you aren’t grouping

We could also just overwrite it:

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  mutate(gender = plyr::mapvalues(gender, c(1,2), c("Male", "Female")))
# A tibble: 2,800 × 8
      C1    C2    C3    C4    C5   age gender education   
   <int> <int> <int> <int> <int> <int> <chr>  <chr>       
 1     2     3     3     4     4    16 Male   <NA>        
 2     5     4     4     3     4    18 Female <NA>        
 3     4     5     4     2     5    17 Female <NA>        
 4     4     4     3     5     5    17 Female <NA>        
 5     4     4     5     3     2    17 Male   <NA>        
 6     6     6     6     1     3    21 Female Some College
 7     5     4     4     2     3    18 Male   <NA>        
 8     3     2     4     2     4    19 Male   HS          
 9     6     6     3     4     5    19 Male   Below HS    
10     6     5     6     2     1    17 Female <NA>        
# ℹ 2,790 more rows

7. summarize() / summarise()

  • summarize() is one of your “apply” functions
  • The resulting data frame will have the same number of rows as your grouping variable
  • You number of groups is 1 for ungrouped data frames
# group_by() education
bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  arrange(education) %>%
  group_by(education) %>% 
  summarize(age_by_edu = mean(age, na.rm = T))  
# A tibble: 6 × 2
  education     age_by_edu
  <chr>              <dbl>
1 Below HS            25.1
2 College             33.0
3 HS                  31.5
4 Higher Degree       35.3
5 Some College        27.2
6 <NA>                18.0

7. summarize() / summarise()

  • summarize() is one of your “apply” functions
  • The resulting data frame will have the same number of rows as your grouping variable
  • You number of groups is 1 for ungrouped data frames
# no groups  
bfi %>% 
  select(starts_with("C"), age, gender, education) %>%
  arrange(education) %>%
  summarize(age_by_edu = mean(age, na.rm = T))  
# A tibble: 1 × 1
  age_by_edu
       <dbl>
1       28.8